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Abstract 


This  paper  shows  that  an  7V-node  AKS  network  (as  described  by  Paterson)  can  be  embedded  in  a  ^-node 
degree-8  multibutterfly  network  with  load  1,  congestion  1,  and  dilation  2.  The  result  has  several  implications, 
including  the  first  deterministic  algorithms  for  sorting  and  finding  the  median  of  n  log  n  keys  on  an  n-input 
multibutterfly  in  O(logn)  time,  a  work-efficient  algorithm  for  finding  the  median  of  n  log^  n  log  log  n  keys 
on  an  n-input  multibutterfly  in  O(lognloglogn)  time,  and  a  three-dimensional  VLSI  layout  for  the  n- 
input  AKS  network  with  volume  While  these  algorithms  are  not  practical,  they  provide  further 

evidence  of  the  robustness  of  multibutterfly  networks.  We  also  present  a  separate,  and  more  practical, 
deterministic  algorithm  for  routing  h  relations  on  an  n-input  multibutterfly  in  0{h-^\ogn)  time.  Previously, 
only  algorithms  for  solving  h  one-to-one  routing  problems  were  known.  Finally,  we  show  that  a  2-folded 
butterfly,  whose  individual  splitters  do  not  exhibit  expansion,  can  emulate  a  bounded-degree  multibutterfly 
with  an  (a,  /3)  expansion  property,  for  any  a  •  /?  <  1/4. 


1  Introduction 


In  1983,  Ajtai,  Komlos,  and  Szemeredi  devised  a  network  for  sorting  n  keys  in  O(logn)  depth  [1].  This 
result  was  surprising  because  no  improvement  in  the  asymptotic  depth  of  sorting  networks  had  been  made 
since  Batcher’s  invention  of  the  0(log^  n)-depth  bitonic  sorting  network  15  years  earlier  [4].  Indeed,  the 
difficulty  of  improving  on  Batcher’s  construction  led  Knuth  to  conjecture  that  there  was  no  sorting  network 
with  depth  O(logn)  [18,  p,  243]. 

The  AKS  sorting  network  differed  from  previous  constructions  in  one  crucial  aspect:  it  incorporated 
expansion  into  its  structure.  Expansion  is  a  graph-theoretic  notion.  An  /  x  r  bipartite  graph  is  said  to  be  an 
[a ^  py  expander  if  every  set  of  k  nodes  on  the  left  side  has  at  least  (dk  neighbors  on  the  right  side,  provided 
that  Ar  <  a/,  where  a  and  /?  are  constants,  a  <  1,  and  ^  >  1.  As  it  happens,  a  random  graph  is  likely  to  be  an 
expander  [29].  There  are  also  explicit  constructions  of  expanders.  These  constructions  were  first  discovered 
by  Margulis  [24,  25],  and  have  since  been  greatly  improved.  So  far,  however,  the  expansion  achieved  by  the 
explicit  constructions  is  still  about  a  factor  of  two  smaller  than  the  expected  expansion  of  a  random  graph. 
A  nice  summary  of  the  state  of  the  art  in  expander  graphs  can  be  found  in  [17]. 

One  drawback  to  the  AKS  network  is  that  the  big-0  notation  hides  large  constant  factors.  In  contrast, 
the  depth  of  the  bitonic  sorting  network  is  (log^  n)/2  +  (logn)/2  [11,  p.  650].  Some  progress  has  been  made 
in  simplifying  the  AKS  network  and  in  improving  the  constant  factors  in  its  depth  [28],  but  for  practical 
values  of  n,  the  depth  of  bitonic  sort  is  much  smaller.  To  date,  however,  all  0(logn)-depth  sorting  networks 
are  based  on  the  AKS  construction. 

Two  notable  AKS-based  sorting  networks  are  Leighton’s  sorting  network  [19],  and  Ma’s  fault-tolerant 
sorting  network  [23].  Leighton  shows  how  to  construct  an  A-node  degree-3  network  capable  of  sorting  N 
keys  in  0{\ogN)  steps.  His  network  implements  the  columnsort  algorithm,  and  uses  a  ©(A/logTVj-input 
AKS  network  in  a  pipelined  fashion.  Ma  shows  how  to  construct  an  n-input  sorting  network  with  O(logn) 
depth  that  can  sustain  constant-probability  passive  faults  at  its  comparators,  and  still  sort  correctly  with 
high  probability.  In  the  passive  fault  model,  a  faulty-comparator  can  be  viewed  as  having  been  removed 
from  the  network. 

Another  network  that  incorporates  expansion  into  its  structure  is  the  multibutterfly.  The  basic  structure  of 
this  network  was  introduced  by  Bassalygo  and  Pinsker  [3],  who  showed  that  two  back-to-back  multibutterflies 
form  an  0(logn)-depth  nonblocking  network.  Here  n  is  the  number  of  input  and  output  terminals  of  the 
network.  A  network  is  called  nonblocking  if  every  unused  input  terminal  can  be  connected  by  a  path  through 
unused  edges  (or  nodes)  to  any  unused  output  terminal,  regardless  of  which  inputs  and  outputs  have  already 
been  connected.  Bassalygo  and  Pinsker  did  not  use  the  term  multibutterfly,  and  their  network  differed  from 
the  multibutterflies  considered  in  the  rest  of  this  paper  in  one  technical  detail:  although  the  out-degree  of 
each  node  in  the  network  was  bounded,  the  in-degree  was  not  necessarily  so.  It  is  not  difficult,  however, 
to  modify  their  construction  so  that  the  degree  of  all  nodes  is  bounded;  they  probably  did  not  consider  it 
important. 

The  term  ^‘multibutterfly”  was  introduced  by  Upfal  [35].  In  his  seminal  paper,  Upfal  proved  that  an  n- 
input  multibutterfly  can  route  any  permutation  of  n  packets  from  the  inputs  to  the  outputs  of  a  multibutterfly 
in  O(logn)  steps  deterministically.  (In  fact,  he  showed  that  even  a  collection  of  log  n  permutations  can  be 
routed  in  O(logn)  time.)  Because  it  can  sort,  the  AKS  network  can  also  solve  these  problems  in  O(logn) 
time.  In  the  AKS  network,  however,  the  running  time  of  the  algorithm  cannot  be  separated  from  the  size 
and  depth  of  the  network.  In  the  multibutterfly,  on  the  other  hand,  although  the  0(log  N)  bound  on  the 
running  time  hides  some  moderately  large  constants,  the  network  itself  can  be  constructed  by  merging  just 
two  copies  of  the  ordinary  butterfly  network  (hence  the  name  multibutterfly).  Furthermore,  simulations 
show  that  the  running  time  of  the  routing  algorithm  is  actually  smaller  than  the  O(logAr)  upper  bound 
implies  [20,  22].  Hence,  a  case  can  be  made  for  the  practicality  of  multibutterflies,  and  several  studies  have 
explored  their  implementation  [9,  10,  13,  14]. 

Although  no  deterministic  O(logn)-step  sorting  algorithm  for  multibutterflies  was  previously  known,  the 
network  was  known  to  have  some  capabilities  that  the  AKS  network  was  not  known  to  have.  For  example, 
Leighton  and  Maggs  showed  that  multibutterflies  are  highly  fault  tolerant  [20].  In  particular,  they  showed 
that  even  if  an  adversary  is  permitted  to  place  /  worst-case  fail-stop  faults  in  a  multibutterfly,  there  is  still 
some  set  of  n  —  0{f)  inputs  and  n  —  0(f)  outputs  between  which  any  permutation  of  packets  can  be  routed 
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in  O(logn)  steps.  In  the  fail-stop  fault  model,  a  faulty  node  cannot  communicate  with  its  neighbors  at  all. 
Fail-stop  faults  are  more  difficult  to  tolerate  than  passive  faults.  Leighton  and  Maggs  also  showed  that  even  if 
every  node  in  the  network  fails  with  some  small,  but  constant,  probability,  with  high  probability  there  is  still 
some  set  of  0(n)  inputs  and  0(n)  outputs  between  which  any  permutation  can  be  routed  in  O(logn)  time. 
As  Bassalygo  and  Pinsker  showed,  the  multibutterfly  can  also  be  used  to  construct  a  nonblocking  network. 
Arora,  Leighton,  and  Maggs  termed  two  back-to-back  multibutterflies  a  multi-Benes  network,  and  showed 
that  not  only  is  a  multi-Benes  network  nonblocking,  but  any  set  of  new  paths  can  be  established  in  this 
network  in  O(logn)  steps,  even  if  many  requests  for  new  paths  are  made  simultaneously  [2].  The  algorithms 
for  reconfiguring  a  multibutterfly  with  faults  and  for  establishing  disjoint  paths  were  later  improved  in  [15] 
and  [30],  respectively. 

1.1  Our  results 

In  this  paper,  we  show  that  multibutterfly  networks  are  at  least  as  powerful  as  the  AKS  sorting  network.  In 
particular,  we  show  that  an  AT-node  AKS  network  can  be  embedded  in  a  ^-node  degree-8  multibutterfly 
with  load  1,  congestion  1,  and  dilation  2.  As  a  consequence  an  iV-node  multibutterfly  can  emulate  an  A^-node 
AKS  network  with  constant  slowdown. 

The  embedding  has  several  immediate  implications.  The  emulation  of  the  AKS  network  by  the  multibut¬ 
terfly,  along  with  Leighton’s  columnsort  algorithm  [19],  yields  the  first  deterministic  0(log  N)-step  algorithm 
for  sorting  N  elements  on  an  A-node  multibutterfly.  The  sorting  algorithm  can  then  be  used  to  construct 
the  first  deterministic  0(log  A^)-step  algorithms  for  finding  the  median  of  N  elements  and  for  routing  with 
combining  on  multibutterflies.  It  also  yields  a  work-efficient  deterministic  algorithm  for  finding  the  median  of 
A/”  log  ATlog  log  A  elements  in  0(  log  A/' log  log  A/')  time  on  an  A/'-node  multibutterfly.  Because  the  embedding 
of  the  AKS  network  into  the  multibutterfly  has  constant  congestion,  bounds  on  the  VLSI  layout  area  and 
volume  for  the  multibutterfly  translate  to  the  AKS  network  as  well.  An  n-input  multibutterfly  network  can 
be  laid  out  in  two  dimensions  with  area  O(n^),  and  in  three  dimensions  with  volume  and  these 

bounds  are  tight.  The  two-dimensional  layout  area  of  the  AKS  network  was  known  before  [6,  7],  but  the 
three-dimensional  layout  is  new. 

We  also  present  two  deterministic  algorithms  for  solving  h-to-one  routing  problems  on  an  n-input  butterfly 
in  0(A-{-log7z)  time.  One  applies  when  h  is  known,  and  the  other  when  it  is  not.  Previous  routing  algorithms 
could  solve  h  one-to-one  problems  in  a  pipelined  fashion  [20,  35],  but  assumed  that  each  packet  carried  the 
label  of  the  one-to-one  problem  to  which  it  belonged.  An  algorithm  for  solving  A-to-one  routing  problems 
can  also  used  to  route  h  relations.  In  an  h  relation,  each  source  sends  at  most  h  packets,  and  each  destination 
receives  at  most  h  packets.  One  motivation  for  designing  algorithms  that  route  h  relations  is  that  routing 
an  h  relation  is  the  primitive  communication  step  in  the  BSP  model  of  computation  [36],  for  which  there  are 
growing  libraries  of  parallel  programs  [16,  27,  26]. 

Finally,  we  show  that  a  2-folded  butterfly  (i.e,,  a  degree-8  multibutterfly),  whose  individual  splitters  do 
not  exhibit  expansion  can  emulate  a  bounded-degree  multibutterfly  with  an  (a,  /?)-expansion  property,  for 
any  a  •  /?  <  1/4. 

The  fact  that  an  7V-node  multibutterfly  network  contains  an  AT-node  AKS  network  does  not  imply  that 
the  multibutterfly  is  an  inherently  impractical  network.  Although  the  sorting  algorithm  implied  by  the 
embedding  is  not  practical,  there  is  no  requirement  that  the  multibutterfly  be  used  in  this  fashion.  Indeed, 
independent  of  the  sorting  algorithm,  the  multibutterfly  is  an  efficient  and  highly  fault-tolerant  routing 
network. 

1.2  Other  related  results 

Prior  to  this  work,  the  fastest  deterministic  algorithm  for  sorting  N  keys  on  an  AT-node  multibutterfly  was 
the  Sharesort  algorithm  of  Cypher  and  Plaxton  [12].  This  algorithm  was  designed  to  run  on  the  butterfly 
network,  or  on  any  other  hypercubic  network.  Since  the  multibutterfly  network  contains  a  butterfly  network, 
it  applies  to  multibutterflies  as  well  (but  doesn’t  take  advantage  of  the  expansion  in  the  multibutterfly). 
There  are  several  variants  of  this  algorithm.  The  fastest  uniform  version  runs  in  0(log  AT(loglog  AT)^)  time, 
but  there  is  a  non-uniform  version  that  runs  in  0(log  AT  loglog  A)  time.  Our  embedding  result  yields  an 
O(log  A)-time  algorithm  for  the  multibutterfly.  Note  that  the  sorting  problem  can  also  be  solved  on  an 
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N-node  butterfly  (or  multibutterfly)  in  0  (log  TV)  time  using  the  randomized  Flashsort  algorithm  of  Reif  and 
Valiant  [21,  34], 

Prior  to  this  work,  the  fastest  deterministic  selection  algorithm  for  multibutterflies  was  the  algorithm 
of  Berthome,  Ferreira,  Maggs,  Perennes,  and  Plaxton  [5],  This  algorithm  selects  the  kth  largest  element 
from  among  N  elements  on  an  TV-node  butterfly  (or  any  other  hypercubic  network)  in  0(logA^log*  N)  time. 
Like  the  Sharesort  algorithm,  this  algorithm  does  not  make  use  of  expansion  when  run  on  a  multibutterfly. 
Since  the  selection  problem  can  be  solved  in  linear  time  sequentially  [8],  this  algorithm,  which  performs 
N  log  TV  log*  N  work,  is  not  work  efficient.  Furthermore,  Plaxton  [31]  showed  that  any  deterministic  algorithm 
for  solving  the  selection  problem  on  a  A^-node  hypercubic  network  requires  Q{{M/N)\og\og  N  +  logA^) 
time  in  the  worst  case,  where  M  is  the  number  of  input  elements.  This  translates  to  a  lower  bound  of 
Q[M  loglog  N N  log  N)  on  the  work  required.  Hence,  there  can  be  no  deterministic  work-efficient  selection 
algorithm  on  a  hypercubic  network.  Recently,  Plaxton  showed  that  for  M/N  =  log  AT,  any  deterministic 
algorithm  for  selection  on  a  bounded-degree  N-node  hypercubic  network  requires  Q(log^/^  N)  steps  [32].  He 
also  presents  an  algorithm  that  runs  in  0(log^/^  A/'(loglog  A/^)^)  time  on  any  A^-node  hypercubic  network. 

For  bounded-degree  expander-based  networks,  two  optimal  deterministic  algorithms  for  selection  are 
known.  For  the  case  of  finding  the  kth.  largest  out  of  N  elements  on  an  N-node  network,  the  AKS 
sorting  network  combined  with  columnsort  can  be  used  to  sort  the  elements  (and  hence  solve  the  se¬ 
lection  problem)  in  0{logN)  time  [19].  This  algorithm  is  optimal  because  selection  on  any  bounded- 
degree  A-node  network  requires  Q{logN)  time.  The  ^th  largest  of  M  elements,  M  >  N,  can  be  found 
in  0{{M/N)  drlog N  \og\og{M / N))  time  on  an  A-node  expander-based  network  using  an  implementation  of 
a  PRAM  algorithm  due  to  Vishkin  [37]  that  invokes  the  AKS  sorting  network  and  columnsort  as  subroutines 
[31].  This  algorithm  is  work-optimal  for  M/N  >  log  V  log  log  (M/ AT)).  Our  embedding  result  implies  that 
a  multibutterfly  network  can  perform  both  of  these  algorithms.  Note  that  the  latter  algorithm  beats  Plax¬ 
ton ’s  lower  bound  for  hypercubic  networks,  thus  implying  a  separation  in  power  between  expander-based 
networks  and  hypercubic  networks.  Rappoport  [33]  has  recently  proved  an  even  larger  separation,  namely 
that  the  largest  butterfly  that  can  efficiently  simulate  a  AT-node  multibutterfly  has  fewer  than  N^  nodes,  for 
all  constants  e  >  0.  For  u;(l)  <  M/N  <  o{\ogN\oglog{M/N))  the  asymptotic  complexity  of  selection  on 
bounded-degree  networks  is  currently  not  known. 

1*3  Outline 

The  remainder  of  this  paper  is  organized  as  follows.  In  Sections  2  and  3  we  define  the  multibutterfly  and 
AKS  networks,  respectively.  Our  embedding  of  an  AKS  network  into  a  multibutterfly  network  is  presented  in 
Section  4.  Algorithms  for  routing  /i-relations  on  multibutterflies  are  described  in  Section  5,  In  Section  6  we 
show  that  a  2-folded  butterfly  can  simulate  a  multibutterfly  with  an  (a,/?)  expansion  property.  We  conclude 
in  Section  7  with  some  open  problems. 


2  Multibutterfly  networks 


A  d-dimensional  multibutterfly  network  (MBF)  consists  of  d-h  1  levels,  each  consisting  of  2^  nodes.  Let  (£J) 
denote  the  jth  node  on  level  £.  For  each  level  0  <  ^  <  d,  the  nodes  on  level  £  are  partitioned  into  2^  sets 


Ai^o, . . . ,  Ai  2^-1 . 


The  nodes  in  Ai^i  are  connected  to  some  nodes  in  the  sets  and  The  subgraph  induced  by 

the  nodes  in  these  three  sets  is  called  the  splitter  of  It  consists  of  two  concentrators,  a  left  and  a  right 
one.  The  left  concentrator's  defined  as  the  subgraph  induced  by  the  nodes  in  Ai^i  and  A^+1^215  and  the  right 
concentrator  is  defined  as  the  subgraph  induced  by  the  nodes  in  Ai^i  and  The  nodes  on  level  0  of 

a  d-dimensional  multibutterfly  are  called  input  nodes,  and  the  nodes  on  level  d  are  called  nodes. 

All  edges  of  a  multibutterfly  network  are  inside  its  concentrators,  i.e.,  each  concentrator  is  a  bipartite 
graph  G  =  {AU  B,E)  with  A  =  Ai^i  and  B  =  A^4-i,2i+i  or  B  for  0  <  ^  <  d  —  1.  The  edges  in 
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a  concentrator  can  be  chosen  in  an  arbitrarily  fashion,  provided  that  each  node  in  A  has  degree  and  each 
node  in  B  has  degree  2k  ^  for  some  constant  integer  k.  This  defines  a  multibutterfly  of  degree  Ak, 

The  multibutterfly  structure  is  very  similar  to  that  of  the  butterfly  network,  i.e.,  the  butterfly  network 
is  a  special  variant  of  the  multibutterfly  of  degree  4.  The  basic  advantage  of  the  multibutterfly  compared 
to  the  butterfly  is  that  the  multibutterfly  may  satisfy  some  expansion  properties  if  the  edges  inside  the 
concentrators  are  chosen  properly.  Let  r(X),  for  a  subset  of  nodes  X,  denote  the  set  of  the  neighbors  of 
the  nodes  in  X.  Then  we  say  a  concentrator  G  =  {A\J  B,E)  has  {a,  (3) -expansion  if  for  any  set  X  C  A 
with  1X|  <  al^l,  we  have  |r(X)  fl  B\  >  /?|X|.  A  multibutterfly  is  said  to  have  (a, /?)”expansion  if  all  its 
concentrators  have  (a, /?) -expansion.  Upfal  [35]  shows  that  for  any  d,  k,  a,  and  (3  with  2/?  <  ^  -  1,  and 
a  >  there  exists  a  multibutterfly  of  degree  Ak  with  (a,/?)  -expansion. 

Finally,  we  define  a  subclass  of  the  multibutterfly  networks  which  includes  those  multibutterflies  that 
can  be  constructed  by  superimposing  butterfly  networks.  Suppose  the  edges  of  a  multibutterfly  of  degree 
d  can  be  colored  by  k  colors  such  that  the  network  induced  by  the  edges  of  each  color  are  isomorphic  to 
the  butterfly  of  degree  d.  Then  this  multibutterfly  is  called  a  k-folded  butterfly  since  it  can  be  constructed 
by  folding  k  butterfly  networks.  The  labels  of  the  nodes  in  the  Ai^i  sets  of  each  of  these  butterflies  are 
permuted  and  the  k  nodes  of  distinct  butterflies  are  merged  together  to  form  each  multibutterfly  node.  The 
k  butterfly  networks  that  define  a  multibutterfly  are  called  underlying  butterflies  and  we  denote  them  by 

3  The  AKS  network 

Our  description  of  the  AKS  network  is  based  on  Paterson’s  description  [28].  Ours  is  a  little  more  general 
than  Paterson’s  because  we  do  not  describe  the  building  blocks,  i.e.,  the  separators,  and  the  sorters  in  detail. 

The  AKS  network  is  a  sorting  network  that  consists  of  /i  •  T  levels  that  are  partitioned  into  T  stages  of 
width  n  and  of  constant  height  ft.  Let 

Vt  :={(z,ft-ht-^f)  l0<z<n-l,0<ft<iJ-l} 

be  the  set  of  nodes  on  stage  f,  forO<f<t  —  1.  Then  each  node  (^,  j)  is  connected  via  a  forward  edge  to 
node  (i,  j  H-  1),  for  0  <  i  <  n  —  1  and  0<j<H’T  —  1,  In  addition  to  the  forward  edges,  the  network 
contains  compare- exchange  edges  which  connect  nodes  on  the  same  level,  i.e.,  each  forward  edge  connects  a 
node  (i,  j)  with  a  node  (i',  i),  for  0  <  f  <  2'  <  n  —  1  and  0<j<iJ'T— 1.  Each  node  is  incident  to  at  most 
one  compare- exchange  edge. 

The  AKS  network  sorts  n  elements  in  2  •  ft  •  T-  1  =  0(r)  steps.  At  the  beginning  of  step  0,  the  elements 
are  placed  at  the  nodes  on  level  0.  In  each  even  step,  the  two  elements  located  at  the  endpoints  of  each 
compare-exchange  edge  are  compared,  and  the  elements  are  exchanged  if  they  are  in  the  wrong  order.  In 
each  even  step,  the  elements  are  moved  along  the  forward  edges  to  the  next  higher  level.  After  step  2-ft -T— 2, 
the  elements  are  placed  in  sorted  order  on  the  nodes  of  level  H  -  T  —  1. 

Each  stage  of  the  AKS  network  consists  of  several  independent  building  blocks.  All  of  the  compare- 
exchange  edges  are  inside  these  building  blocks.  We  initially  describe  the  widths  of  these  blocks  as  if  they 
were  real  numbers.  Ultimately,  we  will  replace  these  ideal  values  by  appropriate  integers.  Most  of  the  building 
blocks  are  separators ^  some  are  sorters  and,  some  are  forward  blocks.  We  give  a  brief  overview  of  these  blocks 
without  going  into  details.  Each  separator  of  width  m  returns  a  partition  of  its  m  input  elements  into  four 
parts,  FL  (far-left),  CL  (center- left),  CR  (center-right),  FR  (far-right).  We  do  not  describe  the  structure  of 
the  separators  but  we  are  interested  in  the  sizes  of  the  four  partitions.  The  size  of  FL  and  FR  is  A  •  m  and 
the  size  of  CL  and  CR  is  (1  —  A),  e.g,  A  =  1/8.  The  sorters  return  the  m  input  elements  in  sorted  order. 
It  is  convenient  to  implement  the  sorters  as  Batcher’s  bitonic  sorting  network  [4].  All  sorters  have  constant 
width,  so  they  can  be  implemented  in  constant  height  ft.  The  forward  blocks  include  only  forward  edges  and 
no  compare-exchange  edges. 

In  the  following,  we  describe  the  widths  of  the  building  blocks  and  which  output  partitions  of  the  blocks 
in  stage  ^  —  1  are  connected  to  which  input  partitions  of  the  blocks  in  stage  fori  <  t  <T— 1.  Our 
description  is  based  on  an  oblivious  sorting  algorithm  structured  about  a  complete  binary  tree  of  depth  log  n 
which  we  imagine  with  the  root  at  the  top  (on  level  0)  and  leaves  below  (on  level  logn).  The  algorithm 
works  in  T  stages  that  are  equivalent  to  the  stages  of  the  AKS  network. 


4 


Consider  a  binary  tree  B  with  “bags”  at  each  node.  Initially,  the  set  of  n  elements  to  be  sorted  is 
contained  in  the  single  bag  at  the  root.  Suppose  each  node  of  the  tree  partitions  the  elements  that  it  gets 
from  its  parent  into  two  halves  and  sends  the  smaller  half  to  the  left  child  and  the  larger  half  to  the  right 
child.  Then  the  elements  will  arrive  in  sorted  order  at  the  leaves  of  the  tree.  Unfortunately,  it  is  not  possible 
to  split  the  elements  exactly  into  the  two  halves  at  each  node  in  constant  time.  The  strategy  of  the  AKS 
algorithm  is  to  make  an  approximate  partition  of  elements,  which  can  be  done  by  the  separators.  The 
elements  that  are  sent  to  the  wrong  child  are  then  retransmitted  in  a  later  stages. 

We  will  not  describe  the  algorithm  in  detail.  Instead,  we  consider  the  flow  of  the  elements  between  the 
bags.  The  proof  that  the  algorithm  sorts  can  be  found  in  Paterson’s  article  [28].  Associated  with  each  node 
of  the  tree  is  a  bag  that  contains  a  number  of  elements.  The  size  of  a  bag  is  the  number  of  elements  stored 
in  the  bag,  and  the  capacity  of  a  bag  is  the  maximum  number  of  elements  that  can  be  stored  in  that  bag. 
During  most  stages,  a  bag  is  either  empty  or  filled  to  its  capacity.  The  capacity  of  each  bag  at  level  £  is  s  • 
for  some  constant  A,  e.g.  A  =  3,  and  some  value  of  s  that  decreasing  with  time. 

Special  situations  occur  at  the  highest  and  lowest  nonempty  levels  of  the  tree,  so  we  start  with  a  descrip¬ 
tion  of  the  sorting  process  at  intermediate  levels.  The  algorithm  works  in  T  stages  beginning  with  stage  0. 
Each  stage  T  is  implemented  in  stage  T  of  the  AKS  network.  At  odd  stages  (some)  odd  levels  are  full  and 
all  the  bags  at  the  even  levels  are  empty.  The  opposite  holds  at  even  stages.  At  each  stage  the  elements  in 
any  full  bag  are  partitioned  by  a  separator  into  the  four  partitions  FL,  CL,  CR,  and  FR.  The  FL  and  the 
FR  parts  are  sent  back  to  the  parent  bag  and  the  CL  and  CR  parts  are  transferred  down  to  the  left  and 
right  child  bags,  respectively. 


Stage  t 


Stage  ^  -h  1 


Figure  1:  Reduction  of  bag  capacities  after  each  stage. 


Consider  a  bag  with  capacity  b  that  is  empty  at  the  beginning  of  some  stage  and  which  is  filled  to  its 
new  capacity  i/b  at  the  end  of  the  stage,  as  shown  in  Figure  1.  Then 


ub  =  2A6A  H- 


2A 


which  gives 

.  =  2AA+y. 

We  assume  that  <  1,  e.g.  u  =  43/48,  Thus,  the  capacities  diminish  at  each  stage  and  keys  are  squeezed 
down  the  tree  in  the  course  of  the  algorithm.  We  define  the  capacity  of  each  bag  at  level  £  at  the  beginning 
of  stage  i  to  be 

•=  (^-412)  • 

At  the  beginning  of  the  algorithm  all  bags  except  for  the  root  are  empty.  The  root  is  filled  to  its  capacity, 
i.e.,  it  contains  (1  -  1/(4A2)  •  n  keys.  Since  we  would  like  the  root  to  behave  as  if  it  were  an  ordinary  node, 
we  place  above  it  a  subset  of  the  elements  of  size  1/(4A2)  •  n.  This  subset  we  call  the  cold  storage.  The  root 
exchanges  keys  with  the  cold  storage  as  with  a  parent.  The  cold  storage  simulates  half  the  root’s  parent, 
one-fourth  the  root’s  grandparent,  and  so  on.  The  capacity  of  the  cold  storage  in  a  step  t  is  therefore 


Lc-2(t)  +  ^-c_4(t)-l-... 


n  ■  I/* 
4A2 
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if  t  is  even,  and 


1  1  ,  V  n  ‘ 

-.C.,{t)  +  --C.3{t)  +  ...=  — 

if  t  is  odd.  The  cold  storage  is  the  simplest  structure  in  the  AKS-network.  It  is  implemented  as  a  forward 
block. 

During  the  course  of  the  algorithm  the  elements  migrate  down  through  the  tree.  We  will  arrange  that 
there  is  at  most  one  partially  filled  level.  Above  this,  the  levels  are  alternately  empty  and  full  as  already 
described;  below  all  the  levels  are  empty.  To  achieve  this,  we  require  that  at  the  partial  level  each  bag  should 
send  up  to  its  parent  the  normal  number  of  elements  if  it  has  sufficiently  many.  After  this  requirement  is 
met,  any  remaining  elements  can  be  sent  down  to  its  children  in  equal  numbers. 

In  the  final  stages  some  of  the  separators  are  replaced  by  sorters  and  forward  blocks.  In  particular,  if 
the  capacity  of  the  root  bag  is  smaller  than  r,  for  some  constant  r,  e.g.  r  =  160,  then  the  set  of  elements  in 
the  root  bag  and  the  cold  storage  is  sorted  and  separated  into  a  left  and  right  half.  From  these  halves  the 
root  and  the  cold  storage  for  each  subtree  can  be  immediately  formed.  After  the  first  splitting  step,  a  new 
splitting  step  will  be  required  at  regular  bounded  intervals,  i.e.,  whenever  the  capacity  of  a  bag  becomes 
smaller  than  r,  the  separator  is  replaced  by  a  sorter  and  the  elements  are  split  into  two  halves.  The  algorithm 
finishes  after  stage  T  —  1 ,  in  which  the  elements  of  the  bags  on  some  level  are  sorted  and  all  bags  below  this 
level  are  empty. 

The  widths  of  the  building  blocks  in  the  AKS  network  can  be  extracted  from  the  above  description.  All 
sizes  are  specified  as  real  numbers.  Paterson  gives  a  simple  recipe  for  replacing  the  real  numbers  by  integers 
without  straying  far  from  the  ideal  values.  For  each  subtree  rooted  at  a  nonempty  node,  if  the  ideal  total 
size  of  the  subtree  is  a,  then  the  actual  size  is  2\a/2], 

4  Embedding  the  AKS  network  into  a  multibutterfly 

In  this  section,  we  embed  an  AKS  network  into  a  multibutterfly  network.  We  denote  the  width  of  the  AKS 
network  by  n,  the  number  of  stages  by  T,  and  the  height  of  the  stages  by  h.  We  assume  that  the  widths  of 
the  building  blocks  are  defined  by  the  parameters  A,  A,  i/,  and  r  as  described  in  Section  3.  We  prove  the 
following  result. 

Theorem  4.1  An  AKS  network  of  size  N  can  be  embedded  into  a  2-folded  butterfly  of  size  M  <  k-N-\-o{N) 
with  dilation  2  and  congestion  I,  where  k  is  a  small  constant  depending  on  the  AKS  parameters  u,  A,  r,  and 
h. 


Suppose  that  the  AKS  parameters  are  chosen  according  to  Paterson’s  recommendation,  which  should 
minimize  the  size  of  the  AKS  network,  i.e.,  u  =  43/48,  A  =  3,  r  =  160,  and  h  =  36.  Then  k  is  smaller  than 
1.5.  In  the  following,  we  describe  the  embedding  and  prove  the  result  on  the  relationship  of  the  network 
sizes. 

Rough  embedding.  The  description  of  the  AKS  network  is  structured  about  a  binary  tree.  The  nodes  of 
this  tree  represent  bags  whose  sizes  vary  from  stage  to  stage,  i.e.,  over  time.  Instead  of  looking  at  one  binary 
tree  B  with  growing  and  shrinking  bag  sizes,  we  can  imagine  that  we  have  T  trees  Bq,.  .  .,Rt-i  of  fixed 
sized  batches,  such  that  the  batches  in  the  ^th  tree  represent  the  building  blocks  of  the  tih  multibutterfly 
stage.  In  particular,  each  bag  of  tree  Bt  with  size  s  is  realized  as  a  building  block  of  width  s  and  height  h 
in  stage  t. 

A  natural  partition  of  the  AKS  building  blocks  is  to  divide  the  blocks  according  to  their  stages.  Then 
each  partition  corresponds  to  one  of  the  t  trees.  In  fact,  this  partition  is  the  one  implemented  in  the  AKS 
network.  For  the  embedding  into  the  MBF,  we  divide  the  blocks  of  the  AKS  network  according  to  the 
tree-levels  into  partitions  Po,  • .  • ,  Piogn-  That  means  that  partition  includes  all  2^  •  T  building  blocks  that 
are  associated  to  a  node  on  the  ^th  tree-level  in  one  of  the  T  trees.  In  addition,  we  add  the  forward  blocks 
of  the  cold  storage  to  partition  Pq.  Define  the  size  of  a  partition  Pi  to  be  to  the  sum  of  the  sizes  of  all  bags 
on  the  respective  tree  level  £,  This  size  is  denoted  by  \Pi\.  Note  that  some  bags  in  each  partition  have  size 
0,  e.g.,  all  bags  below  the  partial  level. 
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Level  t 


The  MBF  building  blocks  associated  with  the  bags  of  partition  Pi  are  embedded  in  the  £ih  level  of  the 
MBF.  Of  course,  we  have  to  define  more  precisely  which  nodes  in  the  building  blocks  in  partition  Pi  are 
mapped  onto  which  nodes  of  the  MBF  in  level  £.  Divide  each  partition  Pi  into  equal-sized  subpartitions 
Pi,0i  •  •  •  5  Pi, 2^-1  such  that  subpartition  Pi^i  includes  all  blocks  that  correspond  to  the  ith  node  on  level  £  of 
the  AKS  tree  B.  Then  for  0  <  i  <  2^  —  1  and  for  0  <  ^  <  logn  —  1,  subpartition  Pi^i  includes  all  parent 
bags  of  the  bags  in  the  subpartitions  and  Embed  the  AKS  nodes  of  partition  Pi^i  into 

the  MBF-nodes  of  the  set  Of  course,  in  order  to  get  an  embedding  with  load  1,  it  is  required  that 

>  h  •  \Pi^i\  which  is  the  number  of  nodes  represented  by  the  partition  Pi^i,  It  will  be  seen  later  that 
the  size  of  ,*  has  to  be  a  little  bit  larger  than  this  value. 

Now  suppose  we  add  all  AKS  edges  to  the  MBF  regardless  of  the  multibutterfly  structure,  i.e.,  we 
connect  each  pair  of  MBF  nodes  representing  a  pair  of  adjacent  AKS  nodes  by  an  edge.  Then  each  AKS 
edge  that  connects  a  node  of  a  parent  bag  in  subpartition  Pi^i  to  a  node  of  a  child  bag  in  subpartition  Pi+i  2i 
or  P£+i,2i+i  is  represented  by  an  edge  inside  the  multibutterfly  splitter  containing  the  sets  Aij^  A^+i^2*5 
and  A£+i^2i+i-  In  addition,  the  AKS  edges  inside  the  building  blocks,  and  thus  inside  the  subpartitions, 
are  represented  by  edges  inside  the  Aij  sets.  Thus,  we  can  restrict  ourselves  to  give  a  description  of  the 
embedding  inside  the  splitters. 

Fine  embedding.  Consider  a  splitter  consisting  of  the  sets  A  :=  A^,i,  L  :=  A^+i^2i  and  R  :=  A^^i,2j+i- 
Define  m  :=  |A|.  The  edges  between  A,  L,  and  R  are  defined  by  two  butterfly  networks  BFi  and  BF2  that 
are  folded  together  to  a  multibutterfly.  We  assume  that  the  embedding  is  done  for  the  levels  log  n  to  to  ^4- 1. 
That  means  that  the  folding  of  BFi  an  BF2  are  fixed  up  to  level  ^  1.  We  have  to  describe  the  mapping  of 

the  AKS  nodes  in  subpartition  P  :=  Pi^i  onto  the  MBF  nodes  in  A. 

First  we  embed  the  AKS  nodes  so  that  each  compare-exchange  edge  of  the  AKS  network  can  be  simulated 
by  two  edges  of  BFi,  Suppose  the  nodes  in  A  are  labeled  {£,  0),  (^,  1), . . . ,  m  -  1),  the  nodes  in  L  are 
labeled  (£  -h  1, 0), . . . ,  {£,  m/2  -  1),  and  the  nodes  in  R  are  labeled  {£  -h  1,  m/2), . . . ,  m  -  1),  so  that  each 
node  {£,  ?;)  G  A  is  connected  by  an  BFi  edge  to  node  (^  -|- 1,  v)  and  (^  -)-  1,  v  +  m/2  (mod  m))  from  LUR. 
Then  each  node  in  A  is  connected  by  a  left  edge  to  a  node  in  L  and  by  a  right  edge  to  a  node  in  R. 

We  embed  each  pair  of  AKS  nodes  u  and  v  of  P  that  are  connected  by  a  compare-exchange  edge  to  two 
nodes  (^,u')  and  {£,v')  of  A,  respectively,  so  that  v'  -  u'  =  m/2.  Then  the  AKS  edge  between  u  and  v  can 
be  simulated  by  a  path  of  length  2.  The  path  is 

Note  that  the  path  uses  only  left  edges  of  BFi . 

Now  we  embed  the  forward  edges  inside  the  building  blocks.  Until  now  we  have  not  used  the  freedom  to 
determine  the  folding  of  the  two  butterflies,  i.e.,  we  have  not  fixed  the  edges  in  BF2.  Suppose  we  connect 
each  node  of  the  set  R  to  two  nodes  in  A  such  that  each  node  in  A  is  adjacent  to  one  node  in  R.  Then  these 
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edges  are  admissible  choices  for  the  edges  in  BF2.  We  use  this  fact  to  realize  the  forward  edges.  Consider  an 
AKS  node  u  e  P.  Suppose  u  is  a  node  in  row  t  of  the  AKS  and  u  is  connected  by  a  forward  edge  to  a  node 
V  in  row  ^  +  1,  Let  w  be  embedded  in  node  {£,  u')  and  v  in  node  {£,  t;')  of  A.  We  simulate  the  edge  between 
u  and  t;  by  a  path  of  length  2  between  {£,  u')  and  {£,  v^).  This  path  consists  of  a  right  BFi  edge  and  a  right 
BF2  edge.  The  path  is 

{£jU^)  • 

Note  that  this  plugs  in  an  admissible  BF2  edge  between  (^  +  1,  u')  and  [£,  t;'). 

Next  we  embed  the  forward  edges  between  distinct  building  blocks,  i.e.,  the  edges  between  adjacent  AKS 
nodes  embedded  in  level  £  and  level  ^  +  1.  Let  C  L  and  C  be  two  sets  of  nodes  that  are  arranged 
symmetrically,  i.e.,  Cr  “  {{£-\-l,v-\-  m/2)  \{£-\-l,v)  G  Cl-  Define  m'  :=  \Cl  U  CrI  Furthermore,  let 
C  C  A  be  a  set  of  nodes  on  level  £  of  size  m'.  Suppose  none  of  the  BF2  edges  that  we  have  plugged  in 
until  now  is  incident  to  a  node  in  Cl,  Cr,  and  C,  and  suppose  we  plug  in  an  arbitrary  matching  of  BF2 
edges  between  Cl  U  Cr  and  C.  Then  these  edges  are  admissible  BF2  edges.  Now  define  Cl  to  be  the  set  of 
nodes  in  L  that  should  be  connected  to  nodes  in  A,  and  define  Cr  to  be  the  set  of  nodes  in  L  that  should 
be  connected  to  nodes  in  A.  We  assume  that  Cl  and  Cr  are  arranged  symmetrically.  This  can  be  done 
because  the  embedding  into  the  two  submultibutterflies  below  L  and  R  can  be  assumed  to  be  isomorphic. 
Unfortunately,  we  have  already  fixed  the  BF2  edges  incident  to  the  nodes  in  Cr  for  embedding  the  forward 
edges  inside  the  building  blocks.  Therefore,  we  have  to  modify  the  above  embedding  slightly.  Define  C  C  A 
to  be  the  set  of  m!  nodes  above  the  nodes  in  Cl  and  Cr,  i.e.,  C  \=  {{£,  v)  \  {£-\- 1,  v)  G  Cl^Cr}.  We  change 
the  above  embedding  so  that  no  AKS  node  is  mapped  onto  the  nodes  in  C.  This  has  a  nice  consequence: 
the  BF2  edges  incident  to  Cl  and  Cr  are  not  used  for  implementing  the  forward  edges  inside  the  building 
blocks.  Finally,  define  C  C  A  to  be  the  set  of  nodes  that  must  be  connected  to  the  nodes  in  Cl  and  Cr. 
Then  we  can  simulate  these  edges  by  an  appropriate  matching  of  BF2  edges  between  Cl  U  Cr  and  C  that 
completes  the  description  of  the  embedding  from  P  into  A.  The  load  of  our  embedding  is  1,  the  dilation  is 
2,  and  the  congestion  is  1  since  no  multibutterfly  edge  is  used  for  simulating  more  than  one  AKS  edge. 

In  order  to  implement  the  above  embedding  we  have  to  assume  that  the  size  of  A  is  not  to  small,  or  the 
other  way  round,  that  the  size  of  partition  P  is  not  to  big,  i.e.,  the  equation  /i  •  |P|  +  \C\  <  \A\  must  be 
satisfied.  Note  that  IC]  <  2  •  |P|,  |P|  <  |P^|/2^,  and  |A|  <  mf2^,  with  m  denoting  the  number  of  nodes  on  a 
multibutterfly  level.  Thus,  the  above  description  can  be  implemented  if 

{h-\-2)-\Pt\<m,  (1) 

for  every  tree  level  £  of  the  AKS  network.  This  defines  a  constraint  on  the  relationship  between  the  size 
of  the  AKS  network  and  the  multibutterfly.  In  order  to  investigate  this  constraint,  we  first  calculate  some 
properties  of  the  AKS  network. 

Properties  of  the  AKS  network.  Define  the  capacity  Ct[i)  of  a  level  £  in  the  AKS  tree  to  be  the  sum 
of  the  capacities  of  all  bags  on  this  tree  level.  Then 

Ct{t)  =  2^ .  c,{t)  =  (^1  -  .  (2A)^  . 

Note  that  the  cold  storage  simulates  a  bag  half  the  size  of  the  root  (the  root’s  parent),  one  quarter  the  size 
of  the  root  (the  root’s  grandparent),  and  so  on.  Thus,  we  can  imagine  the  cold  storage  as  partitioned  into 
an  infinite  number  of  virtual  levels  -1,  -2,  -3,  and  so  on,  such  that  the  above  equation  for  Ci{t)  holds  for 
any  integer  — oo  <  f  <  logn  and  t  >0. 

In  the  following,  we  say  two  tree  levels  i  and  ^''are  congruent  if  f  f  (mod  2).  Analogously,  we  say  a 
tree-level  £  and  a  stage  t  are  congruent  if  £  =  t  (mod  2).  For  short  we  write  £  =  £'  oi  £  =  t,  respectively. 
Further,  we  say  tree  level  £  is  above  tree  level  £'{{£<£',  and  £  is  below  £'  if  £>£'.  In  each  stage  t,  each  tree 
level  £  above  the  partial  level  is  filled  to  its  capacity  if  £^t,  and  is  empty  if  £^t.  All  tree  levels  below  the 
partial  level  are  empty. 

For  a  stage  t,  define  At{t)  to  be  the  sum  of  the  capacities  of  all  tree  levels  above  level  £  and  congruent  to 
t.  Then 


Ae{t)  —  Ct-2{t)  +  Ci-4{t)  +  Ci-e{t)  +  . . . 
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=  ('-33^) 

^  ^  1=0 
^  n  .  I/*  .  (2A)^-2 


if  d  =  ^,  and 


Ai{t)  —  C^_i(t)  +  Ci^3{t)  +  C^_5(0  +  •  •  • 


Ics 

1 

11 

CO 

•  n  •  i/‘  •  Y,  (2^)"^’ 

i=-(3:-i) 

ICN 

1 

II 

CO 

»=0 

=  n.z/^.(2A/-i 


if  d  ^  L  Suppose  is  the  partial  tree  level  in  stage  i.  Then  the  sum  of  the  number  of  elements  in  the  bags 
of  the  tree  levels  above  f  is  Ai{t)  if  £  <  £^ .  Of  course,  this  sum  can  be  at  most  n.  As  a  consequence,  if 
Ai{t)  >  n,  then  tree  level  I  is  below  the  partial  level.  Define 


<S:=(£-2).log,(^)  . 

Then  tree  level  £  is  below  the  partial  level,  and  hence  empty,  in  every  stage  t  <1^^  since 

Ai{t)  >  n  '  =  n  . 


Define 

Then  tree  level  £  is  filled  to  its  capacity  in  every  congruent  stage  t  because 

Ai{t)  +  Ci{t) 

<  (^1  -  •  n  •  u*'  ■  (2AY  +  n  •  •  (2^)^-2 

=  n-iy^'-{2AY.(^(l-^'^+(2Ar^^=n. 

which  means  that  £  is  above  the  partial  level.  Further,  define 


Then  the  splitting  step  of  tree  level  £  is  in  the  first  congruent  stage  ^  >  4-  This  is  because 

Cii4)  =(l-  •  n  •  •  {2AY  =  r  , 


which  means  that  Ci{t)  >  r,  for  t  <t2,  and  Ci{t)  >  r,  for  t  <1^-  Finally,  we  show  that  the  AKS  algorithm 
finishes  with  the  splitting  step  in  level 
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in  stage  T  such  that  T  is  the  smallest  integer  congruent  to  t  that  satisfies 

T>4'  =  t  ■  log^  =  logj  n  ■  log^  -  0(1)  . 

This  is  because  T  j  and  the  number  of  elements  stored  in  bags  of  levels  below  £*  in  stage  T  is  0  (since 
Ci{T)  -{-  Ai{T)  >  n)y  and  because  £*  is  the  smallest  level  that  satisfies  these  two  conditions. 

The  size  of  the  AKS  network.  In  this  section,  we  calculate  an  upper  bound  on  the  size  of  the  AKS 
network  that  can  be  embedded  into  a  dimensional  multibutterfly  with  m  nodes  on  each  level  according  to 
the  above  description.  That  means  we  are  looking  for  the  smallest  AKS  network  that  fulfills  equation  1,  i.e., 
(/z  +  2)  •  \Pi\  <  m,  for  every  level  £  of  the  AKS  tree.  We  have  to  bound  the  size  of  each  partition  Pi.  We 
first  assume  ideal  batch  sizes  and  show  later  that  the  results  for  these  values  are  close  to  the  results  for  the 
correct  integer  values. 

A  special  situation  occurs  for  partition  Pq.  This  partition  includes  the  root  bags  and  the  cold  storage. 
The  size  of  the  root  bag  in  an  even  stage  t  is  Co(^),  and  in  odd  stages  the  size  is  0.  The  size  of  the  cold 
storage  in  a  stage  t  is  Ao{t).  Hence,  we  have 

oo  oo 

li^ol  <  X^Co(2t)  +  ^ylo(t) 

t=0  t=0 


=  :  /Ci(z/,A) 

Now  we  bound  the  size  of  partition  Pi,  for  1  <  £  <£* .  We  first  ignore  the  effects  of  the  splitting  step, 
i.e.,  we  assume  that  r  =  0.  Then  the  size  of  Pi  can  bound  as  follows. 

•  In  each  stage  t  ~  d  with  <t  <ti^  the  size  of  the  tree  level  is  at  most  n  —  Ai{t). 

•  In  each  stage  t  =  d  with  i>i\,  the  size  of  the  tree  level  is  at  most  Ci[t)  . 

•  In  all  other  stage  the  size  is  0. 

Thus,  for  1  <  ^  we  have 


\Pi\  < 


-  n  •  2/^0 .  (2A)^  ^  2/^^ 

t=o 
oo 

-f-n  •  •  {2AY  •  2^^* 

i=0 

=  "  ■  ('“S"  (s) + ' + liMbrs))  ^ 

- - V - ' 

=  -.K2{v,A) 

under  the  assumption  that  r  =  0.  Now  we  assume  r  >  0.  That  means  that  the  size  of  tree  level  I  is  increased 
by  At{t)  in  each  stage  t  from  the  splitting  stage  of  level  t  to  the  splitting  stage  of  level  f  +  1.  Therefore  the 
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above  bound  is  increased  by  at  most 


E  ^<(0  < 

< 

< 


(l0S„(j)+2) 

(log„(^)+2) 

log^  (i)  +  2 
”■  yl 

- V - ' 

=  :  K3(i/,  A) 


■  {2AY~^ 

~  4^)  ^ 


Up  to  now  we  have  assumed  that  all  bags  have  ideal  sizes  as  real  numbers.  But  the  integer  sizes  can 
be  bigger  than  the  ideal  ones.  Fortunately,  both  values  differ  only  slightly,  i.e.,  each  bag  of  ideal  size  b  has 
integer  size  at  most  6  +  2  [28].  Consider  a  level  £.  The  number  of  congruent  stages  between  stage  and  t[  is 
at  most  log^(i//2+)/2,  and  the  number  of  batches  on  level  £  in  a  stage  is  at  most  2^  >  2^*  >  {l-l/AA"^) -n/r. 
In  each  congruent  stage  ti  <t  <  +  1,  all  bags  on  this  level  have  ideal  size  at  least  i/r/A,  which  is  an 

upper  bound  on  the  bag  size  in  stage  +  1.  Further,  all  bag  sizes  in  stages  before  and  after  +  1 
are  0.  Thus,  the  correct  integer  size  of  a  level  is  at  most  an  additive  of 


n  ■  (^)  '  ~  ib)  +  ^  ^  j 

V - V - ' 

=:  K4{u,A,r) 


bigger  than  the  ideal  size. 

Define  A, r))  :=  max{Ki,K;2  +  ^3}  +  ^4-  Then  it  holds  \Pi\  <  R  •  for  every  tree  level  t.  An 
AKS  network  can  be  embedded  into  a  multibutterfly  network  with  m  nodes  per  level  if  Equation  1,  i.e., 
(/i  +  2)  •  \Pi\  <  m,  is  satisfied  for  every  level  t.  Thus,  the  embedding  is  possible  if  we  choose 


m 

(/i  4“  2)  •  K 


The  size  of  the  multibutterfly  is  M  =  (log2  m  + 1)  *  m,  and  the  size  of  the  AKS  network  is  A  =  •  T  *  n.  Thus, 

m  >  M/(log2  m  +  1)  and  n  =  N/{h‘T),  In  addition,  log2  m  =  log2  n  +  0(l)  and  T  >  log2  ndog^  (^) 

Thus,  we  have 


N  > 


M  ^h-T 

(A  +  2)  ‘  K  •  (log2 m+1) 


> 


M-h-  (log;  n  ■  log^  (^)  -  6(1)) 
(6  +  2)  •  K  •  (logj  n  +  0(1)) 


M/k  -  o{M) 


for  k{i/,  a,  r,  h)  :=  (1  +  2/h)  •  k/  log^  (^) .  (which  is  at  most  1.462  ....  for  z/  =  43/48,  A  =  3,  r  =  160,  and 
/i  =  36  as  suggested  in  [28]). 

This  completes  the  proof  of  Theorem  4.1. 


5  Routing  /i-relations  on  multibutterflies 

In  this  section,  we  give  a  deterministic  algorithm  for  routing  h-relations  on  a  multibutterfly  with  (a,  /?)- 
expansion.  Given  a  d-dimensional  multibutterfly,  define  Vi  to  be  the  set  of  the  n  =  2‘^  nodes  on  level  £,  for 
0  <  £  <  d.  The  nodes  in  Vo  are  called  input  nodes,  and  the  nodes  in  Vd  are  called  output  nodes.  Then  an 
6-relation  is  a  set  of  tuples  of  input  and  outputs  nodes  RCVoxVd  such  that  each  node  vq  G  Vq  and  each 
node  of  Vd  6  Vd  appears  in  at  most  h  of  the  tuples  in  R.  Each  tuple  {vo,Vd)  €  R  represents  a  packet  that 
should  be  routed  from  an  input  node  vq  on  level  0  to  an  output  node  Vd  on  level  d. 
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Each  multibutterfly  node  can  store  only  a  constant  number  of  packets.  We  assume  that  the  multibutterfly 
has  h  —  1  additional  levels  —{h  —  1), . . . ,  — 1  which  model  the  initial  storage  for  the  at  most  h  •  n  packets. 
Let  Vi  denote  the  set  of  nodes  on  level  I,  for  —[h  —  1)  <  f  <  —1.  We  assume  that  each  level  t  with 
-(h  -  1)  <  ^  <  -1  is  connected  to  level  1  by  an  (a, ^-expander,  i.e.,  for  any  X  CVi  with  |X|  <  an  it 
holds  r(X)nE^+i  >/?|X|. 

Upfal  [35]  presents  an  algorithm  for  routing  a  permutation,  or  1-relation,  in  0(log  n)  steps.  Our  algorithm 
uses  Upfal’s  algorithm  as  a  subroutine. 

Upfal’s  algorithm.  The  algorithm  routes  a  set  of  packets  from  the  input  to  the  output  nodes  of  a  multi¬ 
butterfly  with  (a,  yd)-expansion  with  a  >  c  and  /?  >  H-  e,  for  constant  e  >  0. 

The  rough  routing  paths  can  be  explained  as  follows:  a  packet  stored  in  a  splitter  aims  to  move  along  an 
edge  of  the  left  concentrator  if  its  destination  is  in  the  submultibutterfly  below  the  left  half  of  the  splitter, 
and  it  aims  to  move  along  an  edge  of  the  right  concentrator  if  its  destination  is  in  the  submultibutterfly 
below  the  right  half  of  the  splitter. 

We  assume  that  each  input  node  stores  h  packets  that  are  partitioned  into  L  batches  B(0), 
such  that  no  more  than  am  packets  from  each  batch  are  routed  through  any  splitter  of  size  m  (!).  The 
indices  of  the  batches  are  used  as  priority  keys.  A  packet  in  batch  B{i)  has  higher  priority  than  packets  in 
Ui>i  B{j).  The  edges  of  each  splitter  are  colored  with  2k  colors  so  that  no  two  edges  of  the  same  color  are 
adjacent  to  one  node.  The  algorithm  works  in  iterations.  In  odd  iterations,  the  edges  connecting  odd  levels 
to  even  levels  are  activated.  In  even  iterations,  the  edges  connecting  even  levels  to  odd  levels  are  activated. 
Edges  are  activated  one  after  the  other  according  to  the  color  order.  Thus,  in  each  step,  only  one  edge 
adjacent  to  each  processor  is  activated.  When  an  edge  from  node  {I,  u)  to  node  (f-|- 1,  u)  is  activated,  if  node 
(i,  u)  stores  in  its  buffer  a  packet  with  higher  priority  than  the  packet  stored  in  the  buffer  of  (£  -i- 1,  u),  the 
two  nodes  exchange  packets.  (An  empty  buffer  is  considered  a  packet  with  the  lowest  priority.)  We  extract 
the  following  Lemma  from  Upfal’s  analysis  [35]. 

Lemma  5.1  Suppose  the  batches  are  chosen  so  that  no  more  than  am  packets  from  each  batch  are  routed 
through  any  splitter  of  size  m.  Then  each  packet  has  reached  its  destination  in  time  C)(logn  -|-  L). 

For  permutations,  it  is  easy  to  split  the  packets  into  0(1)  batches  that  fulfill  the  above  condition.  As 
a  consequence,  several  permutations  can  be  pipelined  so  that  Upfal’s  algorithm  takes  time  0(logn  -b  h)  for 
routing  h  permutations.  Note  that  any  h-relation  can  be  split  into  h  disjoint  permutations,  but  it  is  not 
clear  how  to  decompose  an  h  relation  into  h  disjoint  permutations  on  the  multibutterfly.  Thus,  the  main 
problem  of  routing  h-relations  is  to  split  the  packets  into  appropriate  batches. 


The  new  algorithm.  Define  k  to  be  the  smallest  power  of  2  with  K>h/a,  and  define  p:=d-  log  k.  For 
0  <  i  <  2^  j K  —  1,  define  M,-  to  be  the  (log  K)-dimensional  submultibutterfly  with  node  set 

{(Ai)  I  p  <  ^  <  U/kJ  =  »■}  . 


12 


Each  Mi  has  k  inputs  on  level  p  and  k  outputs  on  level  d.  Afy^i  is  the  input  set  of  Mi .  Figure  3  illustrates 
the  situation.  Our  algorithm  works  in  three  phases. 

•  Phase  1:  Partition  the  packets  into  L  2/c  batches  5(0), . .  .,5(5  ~  1)  such  that  B{i)  contains  the 
packets  with  destination  nodes  in  the  set  {(d,  v)  |  v  (mod  L)  =  i}. 

Route  the  packets  with  UpfaPs  algorithm  into  the  ‘"correct”  submultibutterfly  whose  inputs  lie  on  level 
p,  i.e,,  route  each  packet  with  destination  (d,  v)  to  an  arbitrary  node  in  Ap  . 

For  each  node  (p^v)  on  level  p,  store  all  arriving  packets  in  the  column  of  (p,  v),  i.e.,  at  a  node  {i,v) 
with  ^{h  —  1)  <  £  <  dj  such  that  each  node  has  to  store  at  most  a  constant  number  of  packets. 

•  Phase  2:  Give  each  of  the  packets  with  the  same  destination  a  unique  rank,  i.e,,  for  each  Af*  and  each 
output  node  (d,  v)  of  Mi,  number  the  packets  with  destination  (d,  v)  from  0  to  ~  1. 

•  Phase  3:  Partition  the  packets  into  L  :=  k  ^  f2/a]  batches  B{i  +  j  •  «)  :=  with  0  <  f  <  /c  -  1 

and  0  <  j  <  [2/a]  —  1.  B{i,j)  contains  each  packet  p  with  rank  i  and  a  destination  in  {(d,v)\v 

(mod  [)2/al  =j}. 

Finally,  complete  the  routing  with  UpfaPs  protocol  according  to  the  new  batches. 

Intuitively,  we  have  split  the  5-relation  in  Phase  2  into  h  disjoint  relations  5o, . . . ,  Rh-i  according  to  their 
ranks  so  that  all  packets  in  5*  have  distinct  destination. 

Theorem  5.2  The  above  algorithm  routes  an  arbitrary  h-relation  in  time  0(logn  -{-  h). 

Proof:  We  have  to  prove  that  none  of  the  splitters  of  size  m  is  traversed  by  more  than  am  packets.  In 
Phase  1,  the  number  of  packets  from  a  batch  B{j)  passing  through  the  i-th  splitter  on  level  i  of  size  m  is  at 
most 

h‘\{v\v  (mod  L)  =  j,  [v/m\  =  f}|  <  hm/L  +  5  <  am/2  -h  h  . 

Since  the  packets  route  only  through  the  level  0  to  p  -  1  =  d  -  log  «  -  1  in  Phase  1,  we  have  to  consider  only 
splitters  of  size  m  >  2k.  Hence,  h  <  aK  <  amj^,  and  thus  the  number  of  packets  passing  through  a  splitter 
of  size  m  is  at  most  am/2  -\-  h  <  am.  In  Phase  3,  the  number  of  packets  passing  through  a  splitter  of  size 
m  is  at  most 

m/[2/a]  -h  1  <  am  , 

for  m  >  1/a.  (We  assume  that  splitters  of  size  m  <  1/a  are  completely  connected.)  As  a  consequence, 
Phase  1  and  Phase  3  can  be  done  in  time  0(logn  -(-  5)  =  0(logn  -j-  h). 

Note  that  the  bound  on  the  routing  time  for  Phase  1  also  guarantees  that  all  packets  received  by  a  node 
on  level  p  can  be  stored  in  the  respective  column  such  that  each  node  has  to  store  a  constant  number  of 
packets.  This  is  because  each  column  consists  of  5  -h  logn  nodes,  and  each  node  on  level  p  can  receive  at 
most  one  packet  per  time  step. 

Finally,  we  have  to  show  how  the  ranks  in  Phase  2  can  be  computed  efficiently.  For  each  output  node 
(d,  t/)  in  the  submultibutterfly  5,  this  can  be  done  by  a  prefix  computation.  After  this  computation  each 
node  (p,  v)  on  the  input  level  of  5  knows  the  number  of  packets  with  destination  (d,  u)  that  are  stored  in  the 
columns  with  smaller  indices  than  v,  i.e.,  the  number  of  packets  stored  at  nodes  {£,w)  with  w  <  v.  Thus, 
node  (d,  u)  can  compute  an  disjoint  range  of  ranks  for  the  packets  stored  in  its  column. 

K  prefix  computations  can  be  performed  in  time  k  on  each  of  the  submultibutterflies.  Thus,  the  ranks 
can  be  computed  and  distributed  among  the  packets  in  the  columns  in  time  0(5  +  logn),  which  completes 
our  proof.  □ 

In  the  above  algorithm,  we  have  assumed  that  the  value  of  k  is  known  in  advance.  In  order  to  avoid  this, 
the  algorithm  can  double  the  value  of  k  beginning  with  some  k  >  log  n  and  test  for  each  k  if  Phase  1  can  be 
completed  in  the  time  stated  above.  Note  that  this  increases  the  routing  time  by  a  factor  of  at  most  2. 

A  more  practical  solution.  When  5  is  known  in  advance,  and  5  =  O(logn),  another  practical  solution 
is  to  replace  the  /c-input  submultibutterflies  with  k  x  k  meshes  of  trees. 

A  K  X  K  mesh  of  trees  consists  of  an  array  of  nodes  with  k  rows  and  k  columns.  The  nodes  in  each  row 
serve  as  the  leaves  of  a  complete  binary  tree  called  a  row  tree,  and  the  nodes  in  each  column  serve  as  the 
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leaves  in  a  column  tree.  Hence,  node  {i,j)  in  the  array  serves  as  both  the  iih  leaf  in  the  jth  column  tree, 
and  the  jih  leaf  in  the  iih  row  tree.  An  /i-relation  can  be  routed  between  the  roots  of  the  column  trees  and 
the  roots  of  the  row  trees  in  0{h  +  log/c)  steps  by  simply  routing  each  packet  down  its  column  tree  to  the 
appropriate  row,  and  then  up  through  the  row  tree  to  its  root. 

In  our  application,  the  roots  of  the  column  trees  in  a  mesh  of  trees  replace  the  inputs  of  a  /c-input 
submultibutterfly,  and  the  roots  of  the  row  trees  replace  its  outputs.  A  k  x  k,  mesh  of  trees  has  3/c^  —  2k  = 
0(k^)  nodes.  Since  there  are  u/k  meshes  of  trees,  they  contain  a  total  of  0(n*/c)  nodes.  For  h  =  O(logn)  (and 
hence  k  =  O(logn)),  this  total  is  0(n  log n),  the  same  as  the  number  of  nodes  in  an  n-input  multibutterfly. 
Thus,  replacing  the  submultibutterflies  by  the  meshes  of  trees  does  not  increase  the  asymptotic  number  of 
nodes.  Also,  the  VLSI  layout  area  of  a  /c  x  «  mesh  of  trees  is  0(/c^  log^  k).  Since  there  are  u/k  of  them,  their 
total  VLSI  layout  area  is  0(n  •  /clog^  k).  Since  the  layout  area  of  the  multibutterflies  is  0(n^),  replacing  the 
submultibutterflies  with  the  meshes  of  trees  does  not  increase  the  asymptotic  VLSI  layout  area. 


6  Simulating  expansion  on  a  2-folded  butterfly 

The  set  of  edges  in  a  concentrator  of  a  2-folded  butterfly  can  be  split  into  disjoint  subsets  such  that  the 
edges  in  each  of  these  subsets  forms  a  cycle.  As  a  consequence,  if  we  consider,  e.g.,  the  left  concentrator  on 
level  0,  there  exists  an  arbitrarily  large  subset  X  C  Ao,o  with  |r(A)  Pi  Ai,o|  <  \X\  -h  1.  This  means  that  a 
2-folded  butterfly  has  poor  expansion  properties.  However,  the  following  theorem  shows  that  the  effective 
expansion  can  be  improved  by  simulating  multibutterflies  with  higher  degree. 

Theorem  6.1  For  any  (3  <  l/(4a),  there  exists  2-folded  butterfly  A  that  can  simulate  with  constant  slow¬ 
down  a  multibutterfly  B  of  the  same  size  that  has  (a^  (3) -expansion. 

Proof:  We  describe  a  d-dimensional  2-folded  butterfly  A  and  an  equal-sized  multibutterfly  B  of  degree 
Ak  such  that  A  can  simulate  B  with  constant  slowdown.  A  and  B  will  be  constructed  randomly,  and  we 
will  prove  that  the  probability  that  B  has  (a,  y^)-expansion  is  bigger  than  0.  This  proves  that  there  exists  a 
multibutterfly  B  with  appropriate  expansion  that  can  be  simulated  on  a  2-folded  butterfly  A. 

Consider  the  first  k  levels  of  the  2-folded  butterfly  A.  We  define  these  levels  by  describing  the  underlying 
butterfly  networks  BFi  and  5^2,  i.e.,  the  two  butterflies  from  which  A  can  be  constructed.  We  assume  that 
BFi  has  the  “usual”  butterfly  node  labels,  i.e.,  the  edges  of  BFi  connect  a  node  {i,Voy . . .  ,Vd-i)  on  level  I 
to  the  nodes  (f  -{-  1,  •  •  • ,  ,  Vd^i)  and  {t . . . ,  Vd-i)  on  level 

BF2  is  defined  randomly.  For  any  1  <  i  <  k  and  x  G  {0, 1}^,  suppose  <j)k^x  is  a  permutation  cho¬ 
sen  randomly  and  uniformly  from  the  set  of  permutations  on  {0,1}^““^.  Then  each  node  [t,v)  with 
V  =  G  {0, 1}^  is  connected  by  a  BF2-edge  to  node 

if  Ij  *^0)  •  •  •  )  — 1 )  j  •  •  • }  1))  • 

Intuitively,  this  edge  flips  randomly  the  last  d-k  bits  of  the  node  labels.  (The  second  HF2-edge  of  the  node 
which  leads  to  level  i  -\- 1  can  be  chosen  arbitrarily.) 

Next  we  define  the  first  k  levels  of  multibutterfly  B  with  degree  2k.  Consider  level  iofB  with  0  <  i  <  k—1. 
Suppose  TTi^x  is  a  permutation  chosen  randomly  and  uniformly  from  the  set  of  permutations  on  {0, 1}^"^, 
1  ^  ^  ^  X  G  {0, Let  {i,v)  be  a  node  on  level  i  with  v  =  (t^Oj  •  •  •  j  G  {0,1}^. 

Define  x  :=  y  :=  . . . ,  and  2:  :=  .  yVd-i-  Further,  define  y^  :=  7r,‘,(a;^o,2)(2/), 

and  Zi  :=  1  <  i  <  k.  Intuitively,  the  ;r-permutations  switch  randomly  the  y-bits  and  the 

(^^-permutations  switch  randomly  the  2r-bits.  We  connect  {i,v)  =  {i,  {x,  {0,  l},y,z))  with  2k  nodes  on  level 
£-\-ly  i.e.  with  the  nodes 

{i3rl,{x,^,y[,z^i))  and  (^  +  1,  (x,  1,  2r0)  , 

for  I  <  i  <  k.  It  is  easy  to  check  that  all  edges  are  inside  the  splitters  and  that  each  node  on  level  ^  -h  1  is 
the  endpoint  of  2k  edges.  Thus,  H  is  a  multibutterfly  with  degree  4k.  Note  that  all  edges  in  a  concentrator 
on  level  £  are  chosen  independently  (except  that  some  of  them  are  not  allowed  to  end  at  nodes  on  level  £-{-! 
which  have  the  same  y-  and  z-bits). 
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The  2“folded  butterfly  A  can  simulate  B  with  constant  slowdown  since  there  is  a  path  in  A  of  length  at 
most  2Ar  +  1  from  {£,  v)  to  any  adjacent  node  on  level  ^  -f  1.  For  1  <  i  <  fc  and  6,  b'  G  {0, 1},  this  path  can 
be  constructed  as  follows 


=r 

(£,  {x,b,y,z)) 

BFi 

(fc,  (a;,  6',  24,2)) 

BFi 

{i+l,{x,h',y'i,z)) 

{i,{x,b',y[,z'0) 

BFi 

(f+  l,(x,b',y'i,z'i) 

We  now  investigate  the  expansion  of  B,  Consider  one  of  the  concentrators  in  the  first  k  levels.  It  consists 
of  a  node  set  A  :=  At^i  and  a  node  set  B  :=  A^+i,2i(+i)  with  0  <  ^  <  A:  -  1  and  0  <  i  <  2^  -  1.  Define 
m  :=  \A\.  The  probability  that  all  edges  in  these  concentrator  that  are  incident  to  nodes  in  a  subset  X  C  A 
have  their  endpoints  in  a  subset  y  C  5  is  at  most  .  As  a  consequence,  the  probability  that  the 

concentrator  has  no  (a,  ^)-expansion  is  at  most 


< 


< 


[of-mj 

E  E  E 

fi  —  l  XCA  YCB 

|X|=M  |V|=l/3.vJ 


[a-mj 
/Z  =  l 


We  choose  k  >  (/3-ln(l/(2a/3))+ln(4/a)+2)/  log(l/(4a/?)).  Then  the  above  term  that  bounds  the  probability 
of  a  bad  event  in  one  concentrator  by  2''~*.  Thus,  the  probability  that  all  2*+^  concentrators  have 
expansion  on  the  first  k  levels  is  greater  than  0.  Consequently,  we  can  choose  the  edges  of  A  so  that  A  can 
simulate  the  first  k  levels  of  a  Multibutterfly  with  (a,  beta)  expansion.  The  levels  Ar  to  c?  -  1  of  A  can  be 
viewed  as  2^  independent  2~folded  Butterflies  of  dimension  d—k.  Applying  the  above  scheme  recursively  to 
these  butterflies  completes  our  proof.  □ 


7  Open  problems 

We  conclude  with  a  few  open  problems. 

1.  Can  an  A-node  multibutterfly  whose  splitters  have  an  (a, /?)-expansion  property  be  embedded  with 
constant  load,  congestion,  and  dilation,  in  an  0(iV)~node  AKS  network  whose  e-halvers  have  an  {a,  (3) 
(or  better)  expansion  property? 

2.  What  is  the  complexity  of  selecting  the  Arth  largest  item  from  among  M  items  on  an  iV-node  bounded- 
degree  network  for  for  a;(l)  <  M/N  <  o(log  A  loglog(M/A/'))? 
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